feat(devx): local dev setup for control plane and full end-to-end flow (MLI-6681)#823
feat(devx): local dev setup for control plane and full end-to-end flow (MLI-6681)#823lilyz-ai wants to merge 13 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a one-command local development workflow for the model engine control plane so developers can iterate on gateway/service-builder code without building prod images or touching live infra. - docker-compose.local.yml: spins up Postgres 15 + Redis 7 - service_configs/service_config_local.yaml: HMI config for local services - Makefile: dev-up / dev-migrate / dev-server / dev-down / test targets - LOCAL=true env var now activates fake queue/docker implementations (parallel to existing CIRCLECI=true path) and skips GIT_TAG requirement - README: new "Control Plane Local Setup" section with full walkthrough Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…G_PATH - service_config_local.yaml: switch from cache_redis_aws_url to cache_redis_onprem_url so the Redis URL is resolved before the cloud_provider assertion fires — fixes startup failure for non-AWS configs - Makefile: pin ML_INFRA_SERVICES_CONFIG_PATH to default.yaml so local dev is not affected by a developer's ambient infra config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- README: add ML_INFRA_SERVICES_CONFIG_PATH to the manual env-var snippet so developers with non-AWS ambient configs don't accidentally hit the cloud_provider assertion - docker-compose.local.yml: mount a named volume for Postgres so the database survives dev-down/dev-up cycles Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the manual until-loops in dev-up with `docker compose up --wait`, which blocks until healthchecks pass and exits non-zero if they fail — eliminating the infinite-spin on container crash. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@greptile review |
|
/greptile |
Extends the local dev setup so the complete control plane → Service Builder → k8s inference pod flow can be tested locally without cloud credentials. Changes: - local-full.yaml: new onprem infra config pointing to localhost Redis/kind - dependencies.py: LOCAL=true + cloud_provider=onprem falls through to real Redis queue delegate instead of the fake (enabling full k8s flow) - service_builder/celery.py: fix onprem to use redis backend not s3 - env_vars.py: default GIT_TAG to "local" when LOCAL=true so k8s templates reference the correct model-engine:local image loaded into kind - Makefile: kind-up/kind-down/kind-image targets + dev-server-full, dev-service-builder, dev-k8s-cacher targets using FULL_LOCAL_ENV - README: full end-to-end setup section with step-by-step instructions, example endpoint creation, and flow table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gateway's module-level backend_protocol had the same aws/gcp/azure mapping as service_builder/celery.py. Without this fix, the Service Builder writes task results to Redis but the Gateway looks in S3, leaving endpoints stuck in PENDING under the kind-based full local flow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@greptile review |
|
/greptile |
The exporter package was imported unconditionally under the OTEL_AVAILABLE flag which only checked the base SDK, not the exporter. Include it in the try block so OTEL_AVAILABLE stays False when the exporter is absent, fixing the ImportError that caused run_unit_tests_server to fail. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…trol-plane-local-devx
…chema gateway - Reformat correlation.py and celery.py to satisfy black - Move noqa comment to the from...import( line so ruff F401 is suppressed correctly - Pass schema_generator=GenerateJsonSchema() (new required kwarg) to get_definitions() and get_openapi_path() in live_model_endpoints_schema_gateway, creating a fresh instance per route since pydantic rejects reuse Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oes not have this param The param was added to fix a local test failure (FastAPI 0.110.0 requires it) but FastAPI 0.135.1 (pinned in requirements.txt, used by CI) does not accept it, causing mypy call-arg errors. Revert to the original signature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xample Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import ( # noqa: F401 | ||
| OTLPMetricExporter, | ||
| ) |
There was a problem hiding this comment.
OTLP exporter import tightens OTel availability requirement silently
OTLPMetricExporter is placed in the same try/except ImportError block as the core SDK availability check. This means any environment that has opentelemetry-api + opentelemetry-sdk installed but NOT opentelemetry-exporter-otlp-proto-grpc will now get OTEL_AVAILABLE = False and all trace correlation will silently be skipped. The exporter is only listed in vllm-specific requirements (inference/vllm/requirements.txt), not in the main requirements.txt, making this a fragile dependency for a shared utility. The import should be in its own nested try/except or removed entirely if OTLPMetricExporter isn't actually instantiated in this file.
Prompt To Fix With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/common/startup_tracing/correlation.py
Line: 15-17
Comment:
**OTLP exporter import tightens OTel availability requirement silently**
`OTLPMetricExporter` is placed in the same `try/except ImportError` block as the core SDK availability check. This means any environment that has `opentelemetry-api` + `opentelemetry-sdk` installed but NOT `opentelemetry-exporter-otlp-proto-grpc` will now get `OTEL_AVAILABLE = False` and all trace correlation will silently be skipped. The exporter is only listed in vllm-specific requirements (`inference/vllm/requirements.txt`), not in the main `requirements.txt`, making this a fragile dependency for a shared utility. The import should be in its own nested `try/except` or removed entirely if `OTLPMetricExporter` isn't actually instantiated in this file.
How can I resolve this? If you propose a fix, please make it concise.
Summary
Adds a complete local development workflow for model-engine so developers can iterate on both control plane code and the full endpoint lifecycle without cloud credentials or prod images.
Control-plane-only mode (
make dev-server):LOCAL=trueactivates fake queue/docker/k8s implementations (mirrorsCIRCLECI=true)Full end-to-end mode (
make dev-server-full+make dev-service-builder+make dev-k8s-cacher):make kind-up+make kind-imagecreates a local kind cluster and loadsmodel-engine:localinto itmodel-engine:local) used as the inference container — no GPU requiredCode fixes included:
service_builder/celery.py+celery_task_queue_gateway.py:onpremcloud provider now usesredisCelery backend instead ofs3— without this, the Service Builder writes results to Redis but the Gateway looks in S3, leaving endpoints stuck in PENDINGdependencies.py:LOCAL=true+cloud_provider=onpremfalls through to realOnPremQueueEndpointResourceDelegateinstead of the fakeenv_vars.py:GIT_TAGdefaults to"local"whenLOCAL=trueso k8s templates reference the correctmodel-engine:localimageNew files:
docker-compose.local.yml— Postgres 15 + Redis 7 with healthchecks and persistent volumeservice_configs/service_config_local.yaml— HMI config for local servicesmodel_engine_server/core/configs/local-full.yaml— onprem infra config for kindMakefile— all dev targets in one placeTest plan
make dev-up && make dev-migrate && make dev-server— gateway starts,GET /v1/model-endpointsreturns 200make kind-up && make kind-image— kind cluster created,model-engine:localloadedmake dev-server-full+make dev-service-builder+make dev-k8s-cacher— all three processes start cleanlykubectl --context kind-llm-engine get pods -n model-engineand endpoint transitions to READYmake testCloses MLI-6681
🤖 Generated with Claude Code
Greptile Summary
make dev-server(control-plane only, fake k8s/queue) andmake dev-server-full/make dev-service-builder/make dev-k8s-cacher(full kind-based end-to-end) workflows; backing services are Postgres 15 + Redis 7 via a newdocker-compose.local.yml.onprem: bothcelery_task_queue_gateway.pyandservice_builder/celery.pynow correctly useredisas the Celery result backend forcloud_provider == \"onprem\", anddependencies.pyroutesLOCAL=trueto fake queue delegates for non-onprem configs while still using the realOnPremQueueEndpointResourceDelegatefor the full local flow.OTLPMetricExporteris imported inside the shared SDK availability guard incorrelation.py; this silently expands the OTel requirement to includeopentelemetry-exporter-otlp-proto-grpc, which is only listed in vllm-specific requirements and not in the mainrequirements.txt.Confidence Score: 4/5
Safe to merge for the intended local-dev purpose; one P1 in correlation.py could silently disable tracing in non-standard environments
Core bug fixes (celery backend protocol, dependency wiring) are correct and consistent. The P1 in correlation.py is isolated to environments that have opentelemetry-sdk without the OTLP exporter, which is non-standard given the existing requirements layout, so real-world impact is low but the architectural concern is valid.
model-engine/model_engine_server/common/startup_tracing/correlation.py — the OTLPMetricExporter guard import
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD subgraph Control-plane-only ["Control-plane-only (make dev-server)"] A[LOCAL=true\ncloud_provider=aws] --> B[FakeQueueDelegate] A --> C[Redis TaskQueueGateway\nlocalhost:6379] A --> D[FakeDockerRepository] end subgraph Full["Full end-to-end (make dev-server-full + builder + cacher)"] E[LOCAL=true\ncloud_provider=onprem] --> F[OnPremQueueDelegate] E --> G[Redis TaskQueueGateway\nlocalhost:6379] G -->|Celery task| H[Service Builder\nredis broker + redis backend] H -->|k8s Deployment| I[kind cluster\nmodel-engine:local] I -->|status| J[K8s Cacher\nwrites to Redis] J --> G end subgraph Infra K[(Postgres\nlocalhost:5432)] L[(Redis\nlocalhost:6379)] end C --- L G --- L A --- K E --- KPrompt To Fix All With AI
Reviews (10): Last reviewed commit: "Merge branch 'main' into lilyz-ai/mli-66..." | Re-trigger Greptile